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Abstract 

In this work we first propose a heteroscedastic generalization to 
RVM, a fast Bayesian framework for regression, based on some re- 
cent similar works. We use variational approximation and expectation 
propagation to tackle the problem. The work is still under progress 
and we are examining the results and comparing with the previous 
works. 



1 Introduction 



The Relevance Vector Machines (RVM) |Tipping, 2001 are among the most 



popular fast Bayesian frameworks. In this model it is assumed that the out- 
put are contaminated with a Gaussian noise with constant variance. While 
simple and effective in some applications this assumption is not sufficient, 
and we need to assume variable variance noise processes, i.e. heteroscedas- 
tic models. Gaussian processes (GP) provide a rich nonparametric tools for 
modelling. There are interesting similarities between GP models and RVM; 
in [?, MacKay et al., 19 97 it is shown that infinite equally spaced Gaussian 
kernel functions will converge to a Gaussian process model. Thus one may 
think of RVM fast sparsified GP. 

Due to richness and flexibility of the GP models, there has been a surge of 
interests in using these model for different scenarios, especially for modelling 

as- 



output-heteroscedastic dataQ A ground work by |Goldberg et al., 1997 



sumes a model consisting of a Gaussian process for real prediction and the 



1 Note that heteroscedasticity could be both in input or output, and also note that 
uncertainty is also different from heteroscedasticity. Examples of uncertainty uncertainty 
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noise process (on the output), and evaluates the model using MCMC. Similar 
models are proposed in (Q uadrianto et al., 2009 and Kersting et al., 2007 



which have slow convergence. In |Lazaro-Gredilla and Titsias, 2011] an output- 
heteroscedastic GP via GP noise is introduced which uses a variational ap- 
proximation to train. In the same model in |Munoz-Gonzalez et al., 2011 the 



approximate training is being done using Expectation Propagation Minka, 2001 



In this work follow a similar modelling to |Munoz- Gonzalez et al., 2011 



and Lazaro-Gredilla and Titsias, 2011] to impose heteroscedasticity in RVM 
using a GP model noise. Due to sparsification properties of RVM model over 
GP, our model has faster convergence comparing to |Munoz- Gonzalez et al., 2011 



Lazaro-Gredilla and Titsias, 2011] . The simplified variational approximation 



simplification which is known as collapsed variational approximation is first 
introduced in |King and Lawrence, 2006 Teh et al., 2007 and analytically 
analyzed in Hensman et al., 2012| . We implement the model on several ap- 
plications and compare the practical result to several existing models. 



2 The learning model 

Consider we have the data-set V = {xj, Ui]^ =1 = (X, y), and y is the model's 
scalar output, x is the input vector with length Q. Assume the following 
linear model similar to |Tipping, 2001 , 

M 

y(x) = ^Wi0i(x) + e = $(x)w + e 

i=l 

in which w = [w\, . . . , wm] T is the vector of weights and $(x) = [0i(x), . . . , </>m(x)] 
is the vector of kernel function with length M. We also define N x M design 
matrix as $ = [$(xi) T . . . $(xtv) t ] which will come handy later. We also 
put normal distribution over the weights, 

M M 

p(w\a) = Y[p(wi\oi) = Y[M {wi\0, a' 1 ) = M (w|0, A" 1 ) . 

i=l i=l 

which are not heteroscedastic are |Tresp et al., 1994] | Wright, 1999| |Dallaire et al., 20Tf] 
McHutchon and Rasmussen, 2011 . In addition one other incentive to create output- 
heteroscedastic models, are input-uncertain data, i.e. one could approximate input noise 
by taking its effect in the variable-variance noise in the output into account, i.e. via 
output-heteroscedasity. 
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where A = diag {ai, . . . , The main difference of the above formulation 
with the one in |Tipping, 20 01 is that in the above model, the noise term has 



a variable variance which is modelled with a GPR similar to |Lazaro-Gredilla and Titsias, 2011 
e~Af(0,e^), <7(x)~0P(MK ff ). 
p (f n = $ (x n ) w| A) = N (0, $(x n ) A- 1 $ T (x„)) p(f) = jV(0,K/) . 

The hyper-parameters to be learnt in this model are O = {A,6 g }, in which 
6 g is set of parameters in the covariance function of <?(x). The likelihood is, 



A' 



n=l 



e9«V27r 



exp 



(j/ n - $(x n )w) 



exp 



E 



N 

n=l 



(27T) 



AT/2 



exp 



1 * 

4e 



(y n - $(x n )w) ; 



n=l 



AT(y|M„S, 



in which S ; = diag{e 291 , . . . , e 29N } and /x ; = [<&(xi)w, 
The posterior for each f and g is, 



,$(xjv)w] 



T 



(la) 

(lb) 
(lc) 



(2) 
(3) 



p(g|w,2?,6) <xM(jjLol,K g ) x C 

p(w|g, V,Q) ocAf (0, A" 1 ) x £ 

Note that the above likelihood is only Gaussian with respect to w, not 
g. Considering that both priors are Gaussian, this causes the posterior of 
g in Eqj2] to become non-Gaussian, and need to approximated to become 
tractable. 

Also note that defining f = $w, because of the deterministic relationship 
between f and w, having one, entails having the others distribution, and 
the model resulting model based on f is essentially the same as the one in 



Lazaro-Gredilla and Titsias, 2011 . To keep everything simple here instead 



of writing the Bayes relation on f , we simply write it on w. 



3 Approximate learning 

In RVM training the parameters is achieved via maximizing the marginal 
likelihood, p(y|X, A, a 2 ), in which a 2 is a constant variance. While in out 
setting the variance e 9< ^ is not constant, and it not determined. 
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In this section we formulate the approximations to tackle the problem 
at hand. We'll first introduce the variational methods and then expectation 
propagation method. 



3.1 Variational Approximation 

In variable approximation we approximate the posterior distribution by fac- 
torizing the joint distribution of target parameters into independent distri- 
butions. We then lower bound the marginal likelihood, henceforth instead 
of maximizing the posterior, we maximize the lower bound. A comprehen- 
sive review of variational methods for several applications could be found in 
Wainwright and Jordan, 2008 . We assume our approximate factorization 



as follows, 



p(w,g|y) « g(w,g) = q(w).q(t 



The optimization of the above independence approximation could be done 
using Kullback-Leibler divergence, KL (q (w) .q (g) \\p (w, g|y)), which at the 
same time maximizes the lower bound on the marginal likelihood, 

/p(w, g, y) 
q(g)q(w) log -jzhiZX dwd £ = ? (<?( w )' 9(g)) 



q(g)q(w) 



which can be rewritten as 



logp(y) - J= (g(w), q(g)) = KL (q (w) .q (g) \\p (w, g|y)) 
The solution to the about variational minimization is 



q + (w) oc p(w). exp 



q\g) logp(y|w,g)dg 



q t+ (g) ocp(g).exp 



q t+ (w)logp(y|w,g)dw 



As suggested in |Lazaro-Gredilla and Titsias, 2011| the above lower bound J 7 , 
could be more simplified by exploiting the structure of the problem. It turns 
out that this method was previously introduced by |King and Lawrence, 2 006, 
Teh et al., 2007 and known as collapsed variational inference. The recent 
work |Hensman et al., 2012 introduces a unified view of this model. Simi- 
larly we can analytically marginalize one variable and put in the lower bound 
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to decrease a portion of the optimization process. We analytically compute 
?(w), 



p(w) 



g (wj = argmax = 

g(w) Z (q (w)J 



cxp 



g(g) logp(y|w,g)dg 



(4) 



where Z (g (w)) = J p(w) exp [J g(g) logp(y|w, g)dg] c?w is the normalizing 
constant. Now we simplify the lower bound: 



logp(y) > T\ 



<?(w)=g*(w) 



E 



W,g~<? 



log 



P(w,g,y) 



q(w)q( 



q(w)=<j*(w) 

p(w,g,y)Z 



(5a) 



E *E 



E »E 



log 



log 



g(g)p(w) exp [E g ^ logp(y| w, g)] 

(5b) 

p(y|w, g)p(w)p(g)z 



g(g)p(w) 



9(g) 



E. 



g~9 



log 



Z-E 



9(g) 



g~9 



log 



?(* 



Eg~ 9 logp(y|w, 
(5c) 

(5d) 
(5e) 

Z-KL (g(g)||p(g)) = Tkl 

(5f) 



Which is what we expected, a new variational bound which is functional of 



only the distribution on g. Similar to Hensman et al., 2012 we call this new 
bound J"kl- Based on the assumptions of the problem now we can simplify 



the bound. Following Lazaro-Gredilla and Titsias, 2011 we assume that 
?(g)=JV(g|Ai,S), 



JkL = log J A/"(w|0,A x )exp J Af(g\tJ,, S)logp(y|w,g)dg 
-KL (^(g|/x,S)||^(g|/x l,K ff )) 



dw (6a) 
(6b) 



Substituting the term inside the exponential J A/"(g|/x, S) logp(y|w, g)dg = 
logA/"(y|w, R) — |tr(S) in which [H]u = e^ 4- ^"/ 2 , which makes the bound 



as follows: 

Jkl = log A/"(y|0, $A-^ T + R) - itr(S) - KL(jV(g|/i, E) ||JV(g|/x l, K fl )). 

(7) 

Similar to what is done in ILazaro-Gredilla and Titsias, 20111 one could ex- 



ploit the nature of the formulation and decrease the number of the problems 
to be optimized by using equations of = 0, and = 0, and expressing 
the unknown variables in terms of one single smaller variable A. Simplify- 
ing the update equations will result in the same equation as in appendix of 



Lazaro-Gredilla and Titsias, 2011 , but with more straightforward equation 



for updating ex. parameters, 



da J 3 tlVk- exp {[fi] k - [E] fefc /2} - i-0,(x fe ) 

3.2 Expectation Propagation Approximation 

One could use |Minka, 2001| to do iterative approximation to g's posterior. 
We need to approximate the likelihood in Eq. [Ta]into factors using a Gaussian 
family, 

p(3/n|w,g) « t yn = Z.Af(y n \fi yn ,&lJ, 

The likelihood is, 

TV TV 

C = p(y\w, g)^n ^ = M{y\fi y , S„). J] ^ = C ep 

n=l n=l 

Putting the prior and the likelihood into the Bayes formula in Eq. [2] we get 
the following relation for posterior, 

g(g|w,£>, 0) ocp(q) x C EP 

Such that g(g|w,£>,6) = W(g|^, E) in which S = (j£ g 1 + S^ 1 ) 1 and 

H = EE -1 /! The normalizing constant for the above Bayes relation, Z^p is 
the EP approximation for marginal likelihood. 

In EP at each step we delete the z-th term from the posterior to get the gap 
distribution, and then we add the corresponding exact likelihood to the gap 
distribution. Then we minimize KL(,) for the resulted non-Gaussian distri- 
bution and a Gaussian distribution, and then based on the approximated 
Gaussian, we update t Vn . 
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4 Prediction 



Note that using Eq. H]we have 

q*(w) oc A/-(0, A- 1 ).^(y|f, R) = Af S tt ) 

= 53 w * r R-V, £„, = (A + ^R- 1 *) -1 

Comparing the above posterior mean and variance with that of Eq.6 in 
Tipping et al., 2003 it reveals that the matrix R here acts like the a 2 l noise 



matrix in RVM. Now we can find the predictive distribution by marginalizing 
the latent variables in model. 



5 Notes on convergence 

In conventional RVM, the prior p(w\ot) acts to sparsify the model. During 
the update procedure of the model, some on — > oo, then we put the corre- 
sponding Wi = 0, and continue the training, and results in sparsity of RVM. 
In |Tipping et al., 2003| it's been showed that the local maxima of marginal 



likelihood for some values of happens in infinity. 

Assume the lower bound in EqJTJ in this bound the first part is a func- 
tion of ex and need to be optimized. Comparing this with the likelihood in 
|Tipping et al., 2003| ( Eq.7 ) we see that the factor C = a 2 l + $A _1 * T 
defined in that work is equivalent to C = R + <3>A _1 <I> T . Thus, similarly 
we could follow the Eq.15 to Eq.21 in |Tipping et al., 2003 to show that the 



local maxima of our lower bound happens when some a« go to infinity. So 
in experiments for a big enough arh, for each aii > arh we could put Wi = 
and get smaller model. 

Also note that due to sparsification properties of RVM, our model has 
faster convergence comparing to |Munoz-Gonzalez et al., 2011 Lazaro-Gredilla and Titsias, 2011 



as we calculate a. parameters analytically. 



6 Conclusion and future work 



We derived formulations of Heteroscedastic RVM which are similar to Lazaro-Gredilla and Titsias, 



and has interesting similarities with [Tipping et al., 2003 . This is a report 



of an ongoing work, and we are still experimenting and comparing methods. 
The final results will soon be published. 
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